Notes/Information Theory Basics: Entropy, Compression, and Shannon's Theorem

Information Theory Basics: Entropy, Compression, and Shannon's Theorem

Shannon entropy, the source coding theorem, compression limits, Huffman coding, and why you can never compress truly random data.

2025-10-25AI-Synthesized from Personal Notes

Theory of ComputationInformation TheoryEntropyShannon

Terminology

Term	Definition
Information (Shannon)	The surprise of an event: $I(x) = -\log_2 P(x)$ bits; rare events carry more information than common ones
Entropy ($H$)	The expected information content of a random variable: $H(X) = -\sum p(x) \log_2 p(x)$; measures average uncertainty
Source coding theorem	Shannon's theorem stating that the average code length for a source cannot be less than its entropy $H$; entropy is the compression limit
Prefix-free code	A code where no codeword is a prefix of another, enabling unambiguous decoding; Huffman codes are prefix-free
Huffman coding	A greedy algorithm that builds an optimal prefix-free code by assigning shorter codewords to more frequent symbols
Mutual information	$I(X;Y) = H(X) - H(X\|Y)$: how much knowing $Y$ reduces uncertainty about $X$
Channel capacity	The maximum rate at which information can be reliably transmitted over a noisy channel, measured in bits per use
Kolmogorov complexity	The length of the shortest program that produces a given string; incompressible strings have Kolmogorov complexity close to their length
Lossless compression	Compression where the original data can be perfectly reconstructed; bounded below by entropy

What & Why

Claude Shannon's 1948 paper "A Mathematical Theory of Communication" founded information theory and changed the world. It answers a fundamental question: what is the absolute limit of data compression?

The answer is entropy. If a source produces symbols with known probabilities, the entropy $H$ tells you the minimum average number of bits per symbol needed to encode the source losslessly. You cannot beat entropy, no matter how clever your compression algorithm is.

This has immediate practical consequences. When someone claims a compression algorithm that "compresses any file," they are wrong. A counting argument proves it: there are $2^n$ possible $n$-bit strings but only $2^{n-1} - 1$ shorter strings. Most strings cannot be compressed. Truly random data is incompressible because it has maximum entropy.

How It Works

Shannon Entropy

For a discrete random variable $X$ with possible values ${x_1, x_2, \ldots, x_n}$ and probabilities ${p_1, p_2, \ldots, p_n}$:

$H(X) = -\sum_{i=1}^{n} p_i \log_2 p_i$

Example: A fair coin has $H = -(\frac{1}{2}\log_2\frac{1}{2} + \frac{1}{2}\log_2\frac{1}{2}) = 1$ bit. Maximum uncertainty.

Example: A biased coin with $P(\text{heads}) = 0.9$ has $H = -(0.9\log_2 0.9 + 0.1\log_2 0.1) \approx 0.47$ bits. Less uncertainty, more compressible.

Entropy is maximized when all outcomes are equally likely (maximum uncertainty) and minimized (zero) when one outcome is certain.

Shannon's Source Coding Theorem

For a source with entropy $H$:

No lossless compression can achieve an average code length less than $H$ bits per symbol
There exists a code that achieves average length between $H$ and $H + 1$ bits per symbol

This means entropy is the fundamental compression limit. Huffman coding gets close to this limit, and arithmetic coding can get arbitrarily close.

Why Random Data Cannot Be Compressed

A string of $n$ truly random bits has entropy $H = n$ bits. The source coding theorem says you need at least $n$ bits to encode it. No compression is possible.

Counting argument: There are $2^n$ possible $n$-bit strings. If compression mapped every string to a shorter string, you would need $2^n$ distinct shorter strings, but there are only $\sum_{k=0}^{n-1} 2^k = 2^n - 1$ strings shorter than $n$ bits. By the pigeonhole principle, at least one string cannot be compressed.

Huffman Coding

Huffman coding builds an optimal prefix-free code by greedily merging the two least-frequent symbols:

Average code length: $0.4 \times 1 + 0.3 \times 2 + 0.2 \times 3 + 0.1 \times 3 = 1.9$ bits/symbol.

Entropy: $H \approx 1.85$ bits/symbol. Huffman achieves within 0.05 bits of the theoretical limit.

Complexity Analysis

Algorithm	Compression Ratio	Time	Optimality
Huffman coding	$H \le L < H + 1$	$O(n \log n)$ build, $O(n)$ encode	Optimal among prefix-free codes
Arithmetic coding	$\approx H + \epsilon$	$O(n)$	Approaches entropy for long sequences
LZ77/LZ78 (gzip family)	Asymptotically optimal	$O(n)$	Adapts to source statistics
No compression (identity)	$n$ bits for $n$-bit input	$O(n)$	Baseline

Key formula for the binary symmetric channel with crossover probability $p$:

$C = 1 - H(p) = 1 + p\log_2 p + (1-p)\log_2(1-p)$

Implementation

ALGORITHM ShannonEntropy(probabilities)
INPUT: probabilities: list of symbol probabilities summing to 1
OUTPUT: entropy in bits
BEGIN
  H <- 0
  FOR EACH p IN probabilities DO
    IF p > 0 THEN
      H <- H - p * LOG2(p)
    END IF
  END FOR
  RETURN H
END

ALGORITHM BuildHuffmanTree(symbols, frequencies)
INPUT: symbols: list of symbols, frequencies: corresponding frequencies
OUTPUT: root of Huffman tree
BEGIN
  priorityQueue <- empty min-heap

  FOR i <- 0 TO LENGTH(symbols) - 1 DO
    node <- MakeLeaf(symbols[i], frequencies[i])
    INSERT(priorityQueue, node)
  END FOR

  WHILE SIZE(priorityQueue) > 1 DO
    left <- EXTRACT_MIN(priorityQueue)
    right <- EXTRACT_MIN(priorityQueue)
    parent <- MakeInternal(left, right, left.freq + right.freq)
    INSERT(priorityQueue, parent)
  END WHILE

  RETURN EXTRACT_MIN(priorityQueue)
END

ALGORITHM HuffmanEncode(root, text)
INPUT: root: Huffman tree root, text: string to encode
OUTPUT: encoded bit string
BEGIN
  codeTable <- empty map
  BuildCodeTable(root, "", codeTable)

  encoded <- ""
  FOR EACH symbol IN text DO
    encoded <- encoded + codeTable[symbol]
  END FOR

  RETURN encoded
END

ALGORITHM BuildCodeTable(node, prefix, table)
INPUT: node: tree node, prefix: current code prefix, table: code map
BEGIN
  IF node is a leaf THEN
    table[node.symbol] <- prefix
  ELSE
    BuildCodeTable(node.left, prefix + "0", table)
    BuildCodeTable(node.right, prefix + "1", table)
  END IF
END

Real-World Applications

File compression: gzip (LZ77 + Huffman), zstd (LZ77 + Huffman + FSE), and brotli all use entropy coding as their final stage; the entropy of the data determines the compression limit
Image and video codecs: JPEG uses Huffman coding, H.264/H.265 use context-adaptive binary arithmetic coding (CABAC), and PNG uses DEFLATE (LZ77 + Huffman)
Communication systems: Shannon's channel coding theorem tells engineers the maximum data rate for a given signal-to-noise ratio, guiding the design of 5G, Wi-Fi, and satellite links
Machine learning: cross-entropy loss is directly derived from information theory; minimizing cross-entropy is equivalent to maximizing the likelihood of the model's predictions
Cryptography: a good encryption scheme produces ciphertext with maximum entropy (indistinguishable from random), making it incompressible and unpredictable

Key Takeaways

Shannon entropy $H(X) = -\sum p(x) \log_2 p(x)$ measures the average information content and sets the fundamental limit for lossless compression
The source coding theorem proves you cannot compress below entropy; Huffman and arithmetic coding approach this limit
Truly random data has maximum entropy and cannot be compressed; the pigeonhole principle makes this a mathematical certainty
Huffman coding is optimal among prefix-free codes; arithmetic coding can get arbitrarily close to entropy for long sequences
Information theory underpins compression, communication, machine learning (cross-entropy loss), and cryptography

2025-10-15

Big-O, Big-Theta, Big-Omega: Formal Asymptotic Analysis

Formal definitions of Big-O, Big-Theta, and Big-Omega notation with limit-based proofs, common growth rates, and why Big-O dominates interviews while Big-Theta is more precise.

2025-10-16

Recurrence Relations and the Master Theorem

How to solve recurrence relations using the Master Theorem, recursion tree method, and substitution method, with a detailed walkthrough of T(n) = 2T(n/2) + n.

2025-10-17

P vs NP: The Million-Dollar Question in Computer Science

Understanding the P vs NP problem, decision problems, polynomial-time verification, and why resolving this question would reshape cryptography, optimization, and all of engineering.