Word Embeddings: How King Minus Man Plus Woman Equals Queen
From one-hot vectors to Word2Vec and GloVe, exploring how the distributional hypothesis lets us learn dense vector representations where arithmetic on words captures meaning.
Terminology
| Term | Definition |
|---|---|
| Word Embedding | A dense, low-dimensional vector representation of a word, typically 100 to 300 dimensions, learned from co-occurrence patterns in text |
| Distributional Hypothesis | "You shall know a word by the company it keeps" (Firth, 1957): words appearing in similar contexts have similar meanings |
| One-Hot Encoding | A sparse vector of length $V$ with a single 1 at the word's index and 0 everywhere else; no similarity information between words |
| Word2Vec | A family of two architectures (CBOW and Skip-gram) that learn word embeddings by predicting words from their context or vice versa |
| CBOW (Continuous Bag of Words) | A Word2Vec variant that predicts the center word from the average of its surrounding context word vectors |
| Skip-gram | A Word2Vec variant that predicts surrounding context words given the center word, performing better on rare words |
| GloVe (Global Vectors) | An embedding method that factorizes the global word co-occurrence matrix, combining the benefits of count-based and prediction-based methods |
| Negative Sampling | A training trick that replaces the expensive softmax over the full vocabulary with a binary classification between true context words and randomly sampled "negative" words |
| Cosine Similarity | The cosine of the angle between two vectors: $\frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}$, used to measure semantic similarity between embeddings |
What & Why
One-hot vectors treat every word as equally different from every other word. "Cat" and "dog" are just as distant as "cat" and "democracy." This is a terrible representation for any task that requires understanding meaning, because it encodes zero semantic information.
Word embeddings solve this by mapping each word to a dense vector in a continuous space where similar words are close together. The key insight is the distributional hypothesis: words that appear in similar contexts tend to have similar meanings. "Cat" and "dog" both appear near "pet," "food," "vet," so their learned vectors end up nearby.
The most striking property of these embeddings is that vector arithmetic captures semantic relationships. The vector for "king" minus "man" plus "woman" lands near "queen." This is not programmed in; it emerges from the statistics of word co-occurrence. This discovery (Mikolov et al., 2013) demonstrated that simple neural networks trained on raw text could learn structured representations of meaning.
How It Works
From One-Hot to Dense Vectors
A one-hot vector for vocabulary size V = 50{,}000 is a 50,000-dimensional vector with a single 1. An embedding maps this to a dense vector of dimension d = 300. The embedding matrix W \in \mathbb{R}^{V \times d} stores one row per word. Looking up a word's embedding is just selecting its row from W.
Skip-gram Architecture
Given a center word, predict the surrounding context words within a window of size c. For the sentence "the cat sat on the mat" with window c = 2 and center word "sat," the training pairs are: (sat, cat), (sat, the), (sat, on), (sat, the).
The objective maximizes:
With negative sampling, the probability is approximated as:
where k negative samples are drawn from a noise distribution P_n(w) \propto f(w)^{3/4}.
CBOW Architecture
CBOW reverses the direction: given the context words, predict the center word. It averages the context word vectors and passes the result through a single linear layer. CBOW is faster to train and works well for frequent words, while Skip-gram handles rare words better.
GloVe
GloVe (Pennington et al., 2014) takes a different approach. It first builds a global co-occurrence matrix X where X_{ij} counts how often word i appears near word j. Then it learns vectors such that:
This combines the statistical power of count-based methods with the quality of prediction-based embeddings.
Complexity Analysis
| Operation | Time | Space | Notes |
|---|---|---|---|
| Skip-gram training (1 epoch) | $O(T \cdot c \cdot k)$ | $O(V \cdot d)$ | $T$ = tokens, $c$ = window, $k$ = negative samples, $d$ = dimensions |
| GloVe training (1 epoch) | $O(|X_{\neq 0}|)$ | $O(V^2)$ worst case | Iterates over non-zero co-occurrence entries |
| Embedding lookup | $O(1)$ | $O(d)$ | Index into the embedding matrix |
| Nearest neighbor search | $O(V \cdot d)$ | $O(1)$ | Brute-force cosine similarity against all words |
The embedding matrix for $V = 400{,}000$ words at $d = 300$ dimensions requires $V \times d \times 4$ bytes $\approx 480$ MB in float32. Pre-trained GloVe and Word2Vec models are typically distributed as these large matrix files.
Implementation
ALGORITHM TrainSkipGram(corpus, d, c, k, learningRate, epochs)
INPUT: corpus: list of tokens, d: embedding dimension, c: window size,
k: negative samples, learningRate: float, epochs: integer
OUTPUT: W: embedding matrix [V x d]
BEGIN
V <- number of unique tokens in corpus
W <- RANDOM MATRIX [V x d] with small values // target embeddings
W' <- RANDOM MATRIX [V x d] with small values // context embeddings
freqs <- frequency of each word in corpus
noiseDist <- freqs[i]^0.75 / SUM(freqs[j]^0.75 for all j)
FOR epoch FROM 1 TO epochs DO
FOR t FROM 0 TO LENGTH(corpus) - 1 DO
centerIdx <- INDEX OF corpus[t]
FOR j FROM -c TO c, j != 0 DO
IF t + j < 0 OR t + j >= LENGTH(corpus) THEN CONTINUE
contextIdx <- INDEX OF corpus[t + j]
// Positive sample: push center and context together
score <- DOT(W[centerIdx], W'[contextIdx])
grad <- (SIGMOID(score) - 1) * learningRate
W[centerIdx] <- W[centerIdx] - grad * W'[contextIdx]
W'[contextIdx] <- W'[contextIdx] - grad * W[centerIdx]
// Negative samples: push center away from random words
FOR i FROM 1 TO k DO
negIdx <- SAMPLE from noiseDist
score <- DOT(W[centerIdx], W'[negIdx])
grad <- SIGMOID(score) * learningRate
W[centerIdx] <- W[centerIdx] - grad * W'[negIdx]
W'[negIdx] <- W'[negIdx] - grad * W[centerIdx]
END FOR
END FOR
END FOR
END FOR
RETURN W
END
ALGORITHM WordAnalogy(W, wordA, wordB, wordC, vocab)
INPUT: W: embedding matrix, wordA/B/C: strings (e.g., "king","man","woman")
OUTPUT: closest word to vector(wordA) - vector(wordB) + vector(wordC)
BEGIN
target <- W[INDEX(wordA)] - W[INDEX(wordB)] + W[INDEX(wordC)]
bestWord <- ""
bestSim <- -INFINITY
FOR EACH word IN vocab DO
IF word = wordA OR word = wordB OR word = wordC THEN CONTINUE
sim <- DOT(target, W[INDEX(word)]) / (NORM(target) * NORM(W[INDEX(word)]))
IF sim > bestSim THEN
bestSim <- sim
bestWord <- word
END IF
END FOR
RETURN bestWord // e.g., "queen"
END
Real-World Applications
- Sentiment analysis: Pre-trained word embeddings provide a semantic foundation for classifiers, letting models generalize from "great" to "excellent" without seeing both in training data
- Named entity recognition: Embedding features help sequence labelers identify entities by capturing distributional similarity between known and unknown entity words
- Machine translation: Cross-lingual embeddings map words from different languages into a shared vector space, enabling translation by nearest-neighbor lookup
- Recommendation systems: Item2Vec applies the Skip-gram idea to user interaction sequences, learning item embeddings where co-purchased products are nearby
- Biomedical NLP: BioWordVec trains embeddings on PubMed abstracts, capturing domain-specific relationships like drug-disease associations
- Analogy and bias detection: Vector arithmetic reveals societal biases in training data (e.g., "doctor" closer to "man" than "woman"), enabling debiasing research
Key Takeaways
- Word embeddings map words to dense vectors where semantic similarity corresponds to geometric proximity, solving the fundamental limitation of one-hot encoding
- The distributional hypothesis ("a word is known by the company it keeps") is the theoretical foundation: co-occurrence statistics encode meaning
- Skip-gram predicts context from a center word and excels at rare words; CBOW predicts the center word from context and trains faster on frequent words
- GloVe factorizes the global co-occurrence matrix, combining count-based and prediction-based approaches into a single objective
- Vector arithmetic captures semantic relationships (king - man + woman = queen), an emergent property that was not explicitly trained for